Exploratory Data Analysis (EDA)¶
Obsah¶
In [ ]:
import json
import math
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE, RFECV, SequentialFeatureSelector
from sklearn.inspection import permutation_importance
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.utils.class_weight import compute_class_weight
# Ignore warnings
warnings.filterwarnings("ignore")
Dataset zo syntetickej premávky ¶
In [2]:
data = pd.read_csv("../synthetic_data.csv")
data.head()
Out[2]:
| timestamp | amf_session_value | bearers_active_value | fivegs_amffunction_amf_authreject_value | fivegs_amffunction_amf_authreq_value | fivegs_amffunction_mm_confupdate_value | fivegs_amffunction_mm_confupdatesucc_value | fivegs_amffunction_mm_paging5greq_value | fivegs_amffunction_mm_paging5gsucc_value | fivegs_amffunction_rm_regemergreq_value | ... | process_resident_memory_bytes_value | process_start_time_seconds_value | process_virtual_memory_bytes_value | process_virtual_memory_max_bytes_value | ran_ue_value | s5c_rx_createsession_value | s5c_rx_parse_failed_value | application | log_type | current_uc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2025-04-11 14:41:57 | 4.0 | 4.0 | 0.0 | 4.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 52657356.8 | 3.644742e+08 | 1.151508e+09 | -1.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | uc6 |
| 1 | 2025-04-11 14:41:58 | 4.0 | 4.0 | 0.0 | 4.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 52657356.8 | 3.644742e+08 | 1.151508e+09 | -1.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | uc6 |
| 2 | 2025-04-11 14:41:59 | 4.0 | 4.0 | 0.0 | 4.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 52657356.8 | 3.644742e+08 | 1.151508e+09 | -1.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | uc6 |
| 3 | 2025-04-11 14:42:00 | 4.0 | 4.0 | 0.0 | 4.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 52657356.8 | 3.644742e+08 | 1.151508e+09 | -1.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | uc6 |
| 4 | 2025-04-11 14:42:01 | 4.0 | 4.0 | 0.0 | 4.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 52657356.8 | 3.644742e+08 | 1.151508e+09 | -1.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | uc6 |
5 rows × 58 columns
Chýbajúce hodnoty, Dátové typy, Duplikáty a Deskriptívna štatistika ¶
In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 43810 entries, 0 to 43809 Data columns (total 58 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 timestamp 43810 non-null object 1 amf_session_value 43810 non-null float64 2 bearers_active_value 43810 non-null float64 3 fivegs_amffunction_amf_authreject_value 43810 non-null float64 4 fivegs_amffunction_amf_authreq_value 43810 non-null float64 5 fivegs_amffunction_mm_confupdate_value 43810 non-null float64 6 fivegs_amffunction_mm_confupdatesucc_value 43810 non-null float64 7 fivegs_amffunction_mm_paging5greq_value 43810 non-null float64 8 fivegs_amffunction_mm_paging5gsucc_value 43810 non-null float64 9 fivegs_amffunction_rm_regemergreq_value 43810 non-null float64 10 fivegs_amffunction_rm_regemergsucc_value 43810 non-null float64 11 fivegs_amffunction_rm_reginitreq_value 43810 non-null float64 12 fivegs_amffunction_rm_reginitsucc_value 43810 non-null float64 13 fivegs_amffunction_rm_registeredsubnbr_value 43810 non-null float64 14 fivegs_amffunction_rm_regmobreq_value 43810 non-null float64 15 fivegs_amffunction_rm_regmobsucc_value 43810 non-null float64 16 fivegs_amffunction_rm_regperiodreq_value 43810 non-null float64 17 fivegs_amffunction_rm_regperiodsucc_value 43810 non-null float64 18 fivegs_ep_n3_gtp_indatapktn3upf_value 43810 non-null float64 19 fivegs_ep_n3_gtp_outdatapktn3upf_value 43810 non-null float64 20 fivegs_pcffunction_pa_policyamassoreq_value 43810 non-null float64 21 fivegs_pcffunction_pa_policyamassosucc_value 43810 non-null float64 22 fivegs_pcffunction_pa_policysmassoreq_value 43810 non-null float64 23 fivegs_pcffunction_pa_policysmassosucc_value 43810 non-null float64 24 fivegs_pcffunction_pa_sessionnbr_value 43810 non-null float64 25 fivegs_smffunction_sm_n4sessionestabreq_value 43810 non-null float64 26 fivegs_smffunction_sm_n4sessionreport_value 43810 non-null float64 27 fivegs_smffunction_sm_n4sessionreportsucc_value 43810 non-null float64 28 fivegs_smffunction_sm_pdusessioncreationreq_value 43809 non-null float64 29 fivegs_smffunction_sm_pdusessioncreationsucc_value 43809 non-null float64 30 fivegs_smffunction_sm_qos_flow_nbr_value 43791 non-null float64 31 fivegs_smffunction_sm_sessionnbr_value 43809 non-null float64 32 fivegs_upffunction_sm_n4sessionestabreq_value 43810 non-null float64 33 fivegs_upffunction_sm_n4sessionreport_value 43810 non-null float64 34 fivegs_upffunction_sm_n4sessionreportsucc_value 43810 non-null float64 35 fivegs_upffunction_upf_qosflows_value 43810 non-null float64 36 fivegs_upffunction_upf_sessionnbr_value 43810 non-null float64 37 gn_rx_createpdpcontextreq_value 43810 non-null float64 38 gn_rx_deletepdpcontextreq_value 43810 non-null float64 39 gn_rx_parse_failed_value 43810 non-null float64 40 gnb_value 43810 non-null float64 41 gtp1_pdpctxs_active_value 43810 non-null float64 42 gtp2_sessions_active_value 43810 non-null float64 43 gtp_new_node_failed_value 43810 non-null float64 44 gtp_peers_active_value 43810 non-null float64 45 process_cpu_seconds_total_value 43810 non-null float64 46 process_max_fds_value 43810 non-null float64 47 process_open_fds_value 43810 non-null float64 48 process_resident_memory_bytes_value 43810 non-null float64 49 process_start_time_seconds_value 43810 non-null float64 50 process_virtual_memory_bytes_value 43810 non-null float64 51 process_virtual_memory_max_bytes_value 43810 non-null float64 52 ran_ue_value 43810 non-null float64 53 s5c_rx_createsession_value 43810 non-null float64 54 s5c_rx_parse_failed_value 43810 non-null float64 55 application 43810 non-null object 56 log_type 43810 non-null object 57 current_uc 43810 non-null object dtypes: float64(54), object(4) memory usage: 19.4+ MB
In [4]:
data.describe(include='all')
Out[4]:
| timestamp | amf_session_value | bearers_active_value | fivegs_amffunction_amf_authreject_value | fivegs_amffunction_amf_authreq_value | fivegs_amffunction_mm_confupdate_value | fivegs_amffunction_mm_confupdatesucc_value | fivegs_amffunction_mm_paging5greq_value | fivegs_amffunction_mm_paging5gsucc_value | fivegs_amffunction_rm_regemergreq_value | ... | process_resident_memory_bytes_value | process_start_time_seconds_value | process_virtual_memory_bytes_value | process_virtual_memory_max_bytes_value | ran_ue_value | s5c_rx_createsession_value | s5c_rx_parse_failed_value | application | log_type | current_uc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 43810 | 43810.000000 | 43810.000000 | 43810.0 | 43810.000000 | 43810.000000 | 43810.0 | 43810.0 | 43810.0 | 43810.0 | ... | 4.381000e+04 | 4.381000e+04 | 4.381000e+04 | 43810.0 | 43810.000000 | 43810.0 | 43810.0 | 43810 | 43810 | 43810 |
| unique | 29751 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6 | 5 | 6 |
| top | 2025-04-12 16:25:54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | uc4 |
| freq | 140 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 29015 | 29015 | 11619 |
| mean | NaN | 3.924492 | 3.918238 | 0.0 | 104.807213 | 104.788815 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 5.147382e+07 | 3.677355e+08 | 1.150896e+09 | -1.0 | 1.934239 | 0.0 | 0.0 | NaN | NaN | NaN |
| std | NaN | 0.432750 | 0.442682 | 0.0 | 64.226221 | 64.219504 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.863619e+06 | 1.810116e+06 | 1.110293e+05 | 0.0 | 1.737712 | 0.0 | 0.0 | NaN | NaN | NaN |
| min | NaN | 0.000000 | 0.000000 | 0.0 | 1.000000 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 4.583014e+07 | 3.644742e+08 | 1.150873e+09 | -1.0 | 0.000000 | 0.0 | 0.0 | NaN | NaN | NaN |
| 25% | NaN | 4.000000 | 4.000000 | 0.0 | 39.000000 | 38.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 5.207409e+07 | 3.660509e+08 | 1.150873e+09 | -1.0 | 0.000000 | 0.0 | 0.0 | NaN | NaN | NaN |
| 50% | NaN | 4.000000 | 4.000000 | 0.0 | 113.000000 | 113.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 5.221581e+07 | 3.660509e+08 | 1.150873e+09 | -1.0 | 1.000000 | 0.0 | 0.0 | NaN | NaN | NaN |
| 75% | NaN | 4.000000 | 4.000000 | 0.0 | 159.000000 | 159.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 5.231002e+07 | 3.695255e+08 | 1.150873e+09 | -1.0 | 4.000000 | 0.0 | 0.0 | NaN | NaN | NaN |
| max | NaN | 4.000000 | 4.000000 | 0.0 | 204.000000 | 204.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 5.268357e+07 | 3.695255e+08 | 1.151508e+09 | -1.0 | 4.000000 | 0.0 | 0.0 | NaN | NaN | NaN |
11 rows × 58 columns
In [5]:
data.isnull().sum()[data.isnull().sum() > 0]
Out[5]:
fivegs_smffunction_sm_pdusessioncreationreq_value 1 fivegs_smffunction_sm_pdusessioncreationsucc_value 1 fivegs_smffunction_sm_qos_flow_nbr_value 19 fivegs_smffunction_sm_sessionnbr_value 1 dtype: int64
In [6]:
data.isnull().sum() / len(data) * 100
Out[6]:
timestamp 0.000000 amf_session_value 0.000000 bearers_active_value 0.000000 fivegs_amffunction_amf_authreject_value 0.000000 fivegs_amffunction_amf_authreq_value 0.000000 fivegs_amffunction_mm_confupdate_value 0.000000 fivegs_amffunction_mm_confupdatesucc_value 0.000000 fivegs_amffunction_mm_paging5greq_value 0.000000 fivegs_amffunction_mm_paging5gsucc_value 0.000000 fivegs_amffunction_rm_regemergreq_value 0.000000 fivegs_amffunction_rm_regemergsucc_value 0.000000 fivegs_amffunction_rm_reginitreq_value 0.000000 fivegs_amffunction_rm_reginitsucc_value 0.000000 fivegs_amffunction_rm_registeredsubnbr_value 0.000000 fivegs_amffunction_rm_regmobreq_value 0.000000 fivegs_amffunction_rm_regmobsucc_value 0.000000 fivegs_amffunction_rm_regperiodreq_value 0.000000 fivegs_amffunction_rm_regperiodsucc_value 0.000000 fivegs_ep_n3_gtp_indatapktn3upf_value 0.000000 fivegs_ep_n3_gtp_outdatapktn3upf_value 0.000000 fivegs_pcffunction_pa_policyamassoreq_value 0.000000 fivegs_pcffunction_pa_policyamassosucc_value 0.000000 fivegs_pcffunction_pa_policysmassoreq_value 0.000000 fivegs_pcffunction_pa_policysmassosucc_value 0.000000 fivegs_pcffunction_pa_sessionnbr_value 0.000000 fivegs_smffunction_sm_n4sessionestabreq_value 0.000000 fivegs_smffunction_sm_n4sessionreport_value 0.000000 fivegs_smffunction_sm_n4sessionreportsucc_value 0.000000 fivegs_smffunction_sm_pdusessioncreationreq_value 0.002283 fivegs_smffunction_sm_pdusessioncreationsucc_value 0.002283 fivegs_smffunction_sm_qos_flow_nbr_value 0.043369 fivegs_smffunction_sm_sessionnbr_value 0.002283 fivegs_upffunction_sm_n4sessionestabreq_value 0.000000 fivegs_upffunction_sm_n4sessionreport_value 0.000000 fivegs_upffunction_sm_n4sessionreportsucc_value 0.000000 fivegs_upffunction_upf_qosflows_value 0.000000 fivegs_upffunction_upf_sessionnbr_value 0.000000 gn_rx_createpdpcontextreq_value 0.000000 gn_rx_deletepdpcontextreq_value 0.000000 gn_rx_parse_failed_value 0.000000 gnb_value 0.000000 gtp1_pdpctxs_active_value 0.000000 gtp2_sessions_active_value 0.000000 gtp_new_node_failed_value 0.000000 gtp_peers_active_value 0.000000 process_cpu_seconds_total_value 0.000000 process_max_fds_value 0.000000 process_open_fds_value 0.000000 process_resident_memory_bytes_value 0.000000 process_start_time_seconds_value 0.000000 process_virtual_memory_bytes_value 0.000000 process_virtual_memory_max_bytes_value 0.000000 ran_ue_value 0.000000 s5c_rx_createsession_value 0.000000 s5c_rx_parse_failed_value 0.000000 application 0.000000 log_type 0.000000 current_uc 0.000000 dtype: float64
Chýbajúce hodnoty: Chýbajúce hodnoty v dátach. Je potrebné ich spracovať pred použitím pri ML.
In [7]:
data.nunique()[data.nunique() > 1].apply(lambda x: f"{x:<50}{data.nunique()[data.nunique() > 1].index[data.nunique()[data.nunique() > 1] == x][0]}")
Out[7]:
timestamp 29751 ... amf_session_value 5 ... bearers_active_value 5 ... fivegs_amffunction_amf_authreq_value 204 ... fivegs_amffunction_mm_confupdate_value 199 ... fivegs_amffunction_rm_reginitreq_value 228 ... fivegs_amffunction_rm_reginitsucc_value 199 ... fivegs_amffunction_rm_registeredsubnbr_value 5 ... fivegs_pcffunction_pa_policyamassoreq_value 202 ... fivegs_pcffunction_pa_policyamassosucc_value 202 ... fivegs_pcffunction_pa_policysmassoreq_value 203 ... fivegs_pcffunction_pa_policysmassosucc_value 203 ... fivegs_pcffunction_pa_sessionnbr_value 5 ... fivegs_smffunction_sm_pdusessioncreationreq_value 204 ... fivegs_smffunction_sm_pdusessioncreationsucc_value 204 ... fivegs_smffunction_sm_qos_flow_nbr_value 203 ... fivegs_smffunction_sm_sessionnbr_value 5 ... fivegs_upffunction_sm_n4sessionestabreq_value 201 ... fivegs_upffunction_upf_qosflows_value 5 ... fivegs_upffunction_upf_sessionnbr_value 5 ... process_cpu_seconds_total_value 4373 ... process_open_fds_value 7 ... process_resident_memory_bytes_value 208 ... process_start_time_seconds_value 3 ... process_virtual_memory_bytes_value 4 ... ran_ue_value 5 ... application 6 ... log_type 5 ... current_uc 6 ... dtype: object
In [8]:
data.duplicated().sum()
Out[8]:
np.int64(9739)
Duplikáty: Duplikáty v dátach znamenajú, že stav sa medzi jednotlivými meraniami nezmenil.
Záver ¶
Čo treba zodpovedať:
Chýbajúce hodnoty:
- Koľko chýbajúcich hodnôt je v každom stĺpci?
- fivegs_smffunction_sm_pdusessioncreationreq_value 1
- fivegs_smffunction_sm_pdusessioncreationsucc_value 1
- fivegs_smffunction_sm_qos_flow_nbr_value 19
- fivegs_smffunction_sm_sessionnbr_value 1
- Aké percento datasetu je null?
- fivegs_smffunction_sm_pdusessioncreationreq_value 0.002283
- fivegs_smffunction_sm_pdusessioncreationsucc_value 0.002283
- fivegs_smffunction_sm_qos_flow_nbr_value 0.043369
- fivegs_smffunction_sm_sessionnbr_value 0.002283
- Ako sa vysporiadať s chýbajúcimi hodnotami?
- Odstrániť riadky s null hodnotami alebo ich nahradiť priemerom/mediánom/módom stĺpca?
- Použijeme mód, pretože dáta sú kategorizované.
- Môžeme použiť metódu
fillna(), aby sme null hodnoty nahradili módom stĺpca. - Po spracovaní null hodnôt by sme mali dataset znova skontrolovať, či neobsahuje zostávajúce null hodnoty.
- Koľko chýbajúcich hodnôt je v každom stĺpci?
Dátové typy:
- Aké sú dátové typy každého stĺpca?
- timestamp object
- application object
- log_type object
- current_uc object
- Ostatné stĺpce sú float64
- Ako konvertovať dátové typy?
- Map the columns to the correct data types using the
astype()method - Použite
astype()na konverziu stĺpcov na správne dátové typy.
- Map the columns to the correct data types using the
- Aké sú dátové typy každého stĺpca?
Duplikáty:
- Aké sú duplicitné stĺpce v datasete?
- Viacero riadkov má iba jednu hodnotu, môžeme tieto stĺpce odstrániť.
- Koľko duplicitných riadkov je v datasete?
- 9739
- Ako odstrániť duplikáty?
- Používame časové údaje, takže duplikáty znamenajú, že sa stav nezmenil.
- Aké sú duplicitné stĺpce v datasete?
Príprava dát¶
In [ ]:
# Remove the columns with only one unique value
data = data.loc[:, data.nunique() > 1]
In [ ]:
# Missing values imputation
data.fillna(data.mode().iloc[0], inplace=True)
# Check for missing values again
data.isnull().sum()[data.isnull().sum() > 0]
Out[ ]:
Series([], dtype: int64)
Dáta: Odstránené stĺpce, ktoré nie sú potrebné pre analýzu. Chýbajúce hodnoty boli nahradené módom stĺpca.
In [ ]:
# Convert timestamp to datetime
data['timestamp'] = pd.to_datetime(data['timestamp'])
# Map string values to numerical values
with open('log_map.json', 'r') as f:
LOG_MAP = json.load(f)
with open('app_map.json', 'r') as f:
APP_MAP = json.load(f)
with open('uc_map.json', 'r') as f:
UC_MAP = json.load(f)
data['application'] = data['application'].map(APP_MAP)
data['log_type'] = data['log_type'].map(LOG_MAP)
data['current_uc'] = data['current_uc'].map(UC_MAP)
# Check if the values were correctly mapped
data.dtypes
Out[ ]:
timestamp datetime64[ns] amf_session_value float64 bearers_active_value float64 fivegs_amffunction_amf_authreq_value float64 fivegs_amffunction_mm_confupdate_value float64 fivegs_amffunction_rm_reginitreq_value float64 fivegs_amffunction_rm_reginitsucc_value float64 fivegs_amffunction_rm_registeredsubnbr_value float64 fivegs_pcffunction_pa_policyamassoreq_value float64 fivegs_pcffunction_pa_policyamassosucc_value float64 fivegs_pcffunction_pa_policysmassoreq_value float64 fivegs_pcffunction_pa_policysmassosucc_value float64 fivegs_pcffunction_pa_sessionnbr_value float64 fivegs_smffunction_sm_pdusessioncreationreq_value float64 fivegs_smffunction_sm_pdusessioncreationsucc_value float64 fivegs_smffunction_sm_qos_flow_nbr_value float64 fivegs_smffunction_sm_sessionnbr_value float64 fivegs_upffunction_sm_n4sessionestabreq_value float64 fivegs_upffunction_upf_qosflows_value float64 fivegs_upffunction_upf_sessionnbr_value float64 process_cpu_seconds_total_value float64 process_open_fds_value float64 process_resident_memory_bytes_value float64 process_start_time_seconds_value float64 process_virtual_memory_bytes_value float64 ran_ue_value float64 application int64 log_type int64 current_uc int64 dtype: object
Dátové typy: Všetky stĺpce boli konvertované na správne dátové typy ('float64', 'int64', 'datetime64[ns]').
Analýza jednotlivých stĺpcov ¶
In [12]:
plt.figure(figsize=(8,5))
sns.countplot(x='current_uc', data=data)
plt.title("Distribúcia UC tried (current_uc)")
plt.xlabel("UC trieda")
plt.ylabel("Počet vzoriek")
plt.show()
Rozdelenie UC: Treba brať do úvahy, že niektoré používateľské prípady (UC) sa v datasete vyskytujú častejšie ako iné.
In [ ]:
# Class weights for imbalanced classes
classes = np.unique(data['current_uc'])
weights = compute_class_weight(class_weight='balanced', classes=classes, y=data['current_uc'])
class_weights = dict(zip(classes, weights))
print("Class weights:", class_weights)
Váhy: {np.int64(0): np.float64(0.7482749197239872), np.int64(1): np.float64(1.1597310461668784), np.int64(2): np.float64(0.867078335906266), np.int64(3): np.float64(0.6284247066586338), np.int64(4): np.float64(1.475382232100761), np.int64(5): np.float64(2.638838694133237)}
In [ ]:
# Save class weights to JSON
class_weights_serializable = {int(k): float(v) for k, v in class_weights.items()}
with open('class_weights.json', 'w') as f:
json.dump(class_weights_serializable, f)
Rozdelenie UC: Tento problém je možné vyriešiť pomocou váženého priemeru.
In [ ]:
# Select numerical columns for feature selection
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns.tolist()
n_cols = 3
n_rows = math.ceil(len(numerical_cols) / n_cols)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
axes = axes.flatten()
for i, col in enumerate(numerical_cols):
# Histogram
sns.histplot(data[col], kde=True, bins=30, ax=axes[i])
axes[i].set_title(f'Distribúcia: {col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Frekvencia')
# Remove empty axes
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
In [ ]:
# Correlation matrix
corr = data.corr()
plt.figure(figsize=(25, 25))
plt.title("Correlation Matrix")
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm', square=True, cbar_kws={"shrink": .8})
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
Korelácia: Žiadna metrika nemá silnú koreláciu s UC triedou, ktorú budeme klasifikovať.
Dataset z reálnej premávky ¶
In [18]:
real_data = pd.read_csv("../real_data.csv")
real_data.head()
Out[18]:
| timestamp | amf_session_value | bearers_active_value | fivegs_amffunction_amf_authreject_value | fivegs_amffunction_amf_authreq_value | fivegs_amffunction_mm_confupdate_value | fivegs_amffunction_mm_confupdatesucc_value | fivegs_amffunction_mm_paging5greq_value | fivegs_amffunction_mm_paging5gsucc_value | fivegs_amffunction_rm_regemergreq_value | ... | process_resident_memory_bytes_value | process_start_time_seconds_value | process_virtual_memory_bytes_value | process_virtual_memory_max_bytes_value | ran_ue_value | s5c_rx_createsession_value | s5c_rx_parse_failed_value | application | log_type | current_uc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2025-04-10 12:28:14 | 2.0 | 2.0 | 0.0 | 10.0 | 599.0 | 499.0 | 1034.0 | 498.0 | 0.0 | ... | 50106368.0 | 118464611.5 | 1.404078e+09 | -1.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | uc1 |
| 1 | 2025-04-10 12:28:15 | 2.0 | 2.0 | 0.0 | 10.0 | 599.0 | 499.0 | 1034.0 | 498.0 | 0.0 | ... | 50106368.0 | 118464611.5 | 1.404078e+09 | -1.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | uc1 |
| 2 | 2025-04-10 12:28:16 | 2.0 | 2.0 | 0.0 | 10.0 | 599.0 | 499.0 | 1034.0 | 498.0 | 0.0 | ... | 50106368.0 | 118464611.5 | 1.404078e+09 | -1.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | uc1 |
| 3 | 2025-04-10 12:28:17 | 2.0 | 2.0 | 0.0 | 10.0 | 599.0 | 499.0 | 1034.0 | 498.0 | 0.0 | ... | 50106368.0 | 118464611.5 | 1.404078e+09 | -1.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | uc1 |
| 4 | 2025-04-10 12:28:18 | 2.0 | 2.0 | 0.0 | 10.0 | 599.0 | 499.0 | 1034.0 | 498.0 | 0.0 | ... | 50106368.0 | 118464611.5 | 1.404078e+09 | -1.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | uc1 |
5 rows × 58 columns
Chýbajúce hodnoty, Dátové typy, Duplikáty a Deskriptívna štatistika ¶
In [19]:
real_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6785 entries, 0 to 6784 Data columns (total 58 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 timestamp 6785 non-null object 1 amf_session_value 6785 non-null float64 2 bearers_active_value 6785 non-null float64 3 fivegs_amffunction_amf_authreject_value 6785 non-null float64 4 fivegs_amffunction_amf_authreq_value 6785 non-null float64 5 fivegs_amffunction_mm_confupdate_value 6785 non-null float64 6 fivegs_amffunction_mm_confupdatesucc_value 6785 non-null float64 7 fivegs_amffunction_mm_paging5greq_value 6785 non-null float64 8 fivegs_amffunction_mm_paging5gsucc_value 6785 non-null float64 9 fivegs_amffunction_rm_regemergreq_value 6785 non-null float64 10 fivegs_amffunction_rm_regemergsucc_value 6785 non-null float64 11 fivegs_amffunction_rm_reginitreq_value 6785 non-null float64 12 fivegs_amffunction_rm_reginitsucc_value 6785 non-null float64 13 fivegs_amffunction_rm_registeredsubnbr_value 6785 non-null float64 14 fivegs_amffunction_rm_regmobreq_value 6785 non-null float64 15 fivegs_amffunction_rm_regmobsucc_value 6785 non-null float64 16 fivegs_amffunction_rm_regperiodreq_value 6785 non-null float64 17 fivegs_amffunction_rm_regperiodsucc_value 6785 non-null float64 18 fivegs_ep_n3_gtp_indatapktn3upf_value 6785 non-null float64 19 fivegs_ep_n3_gtp_outdatapktn3upf_value 6785 non-null float64 20 fivegs_pcffunction_pa_policyamassoreq_value 6785 non-null float64 21 fivegs_pcffunction_pa_policyamassosucc_value 6785 non-null float64 22 fivegs_pcffunction_pa_policysmassoreq_value 6785 non-null float64 23 fivegs_pcffunction_pa_policysmassosucc_value 6785 non-null float64 24 fivegs_pcffunction_pa_sessionnbr_value 6785 non-null float64 25 fivegs_smffunction_sm_n4sessionestabreq_value 6785 non-null float64 26 fivegs_smffunction_sm_n4sessionreport_value 6785 non-null float64 27 fivegs_smffunction_sm_n4sessionreportsucc_value 6785 non-null float64 28 fivegs_smffunction_sm_pdusessioncreationreq_value 6785 non-null float64 29 fivegs_smffunction_sm_pdusessioncreationsucc_value 6785 non-null float64 30 fivegs_smffunction_sm_qos_flow_nbr_value 6785 non-null float64 31 fivegs_smffunction_sm_sessionnbr_value 6785 non-null float64 32 fivegs_upffunction_sm_n4sessionestabreq_value 6785 non-null float64 33 fivegs_upffunction_sm_n4sessionreport_value 6785 non-null float64 34 fivegs_upffunction_sm_n4sessionreportsucc_value 6785 non-null float64 35 fivegs_upffunction_upf_qosflows_value 6785 non-null float64 36 fivegs_upffunction_upf_sessionnbr_value 6785 non-null float64 37 gn_rx_createpdpcontextreq_value 6785 non-null float64 38 gn_rx_deletepdpcontextreq_value 6785 non-null float64 39 gn_rx_parse_failed_value 6785 non-null float64 40 gnb_value 6785 non-null float64 41 gtp1_pdpctxs_active_value 6785 non-null float64 42 gtp2_sessions_active_value 6785 non-null float64 43 gtp_new_node_failed_value 6785 non-null float64 44 gtp_peers_active_value 6785 non-null float64 45 process_cpu_seconds_total_value 6785 non-null float64 46 process_max_fds_value 6785 non-null float64 47 process_open_fds_value 6785 non-null float64 48 process_resident_memory_bytes_value 6785 non-null float64 49 process_start_time_seconds_value 6785 non-null float64 50 process_virtual_memory_bytes_value 6785 non-null float64 51 process_virtual_memory_max_bytes_value 6785 non-null float64 52 ran_ue_value 6785 non-null float64 53 s5c_rx_createsession_value 6785 non-null float64 54 s5c_rx_parse_failed_value 6785 non-null float64 55 application 6785 non-null object 56 log_type 6785 non-null object 57 current_uc 6785 non-null object dtypes: float64(54), object(4) memory usage: 3.0+ MB
In [20]:
real_data.describe(include='all')
Out[20]:
| timestamp | amf_session_value | bearers_active_value | fivegs_amffunction_amf_authreject_value | fivegs_amffunction_amf_authreq_value | fivegs_amffunction_mm_confupdate_value | fivegs_amffunction_mm_confupdatesucc_value | fivegs_amffunction_mm_paging5greq_value | fivegs_amffunction_mm_paging5gsucc_value | fivegs_amffunction_rm_regemergreq_value | ... | process_resident_memory_bytes_value | process_start_time_seconds_value | process_virtual_memory_bytes_value | process_virtual_memory_max_bytes_value | ran_ue_value | s5c_rx_createsession_value | s5c_rx_parse_failed_value | application | log_type | current_uc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6785 | 6785.000000 | 6785.000000 | 6785.0 | 6785.000000 | 6785.000000 | 6785.000000 | 6785.000000 | 6785.000000 | 6785.0 | ... | 6.785000e+03 | 6.785000e+03 | 6.785000e+03 | 6785.000000 | 6785.000000 | 6785.0 | 6785.0 | 6785 | 6785 | 6785 |
| unique | 5084 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 | 6 | 6 |
| top | 2025-04-10 13:20:34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | uc1 |
| freq | 9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4572 | 4572 | 3145 |
| mean | NaN | 2.791452 | 2.766839 | 0.0 | 12.206338 | 734.754901 | 531.113780 | 1096.827119 | 529.088578 | 0.0 | ... | 5.114389e+07 | 1.138908e+08 | 1.350298e+09 | -0.960796 | 1.844510 | 0.0 | 0.0 | NaN | NaN | NaN |
| std | NaN | 1.494975 | 1.526840 | 0.0 | 2.765957 | 282.378079 | 125.164509 | 256.344911 | 124.266089 | 0.0 | ... | 1.062879e+07 | 2.303680e+07 | 2.735730e+08 | 0.194095 | 1.404222 | 0.0 | 0.0 | NaN | NaN | NaN |
| min | NaN | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -1.000000 | 0.000000 | 0.0 | 0.0 | NaN | NaN | NaN |
| 25% | NaN | 1.000000 | 1.000000 | 0.0 | 12.000000 | 618.000000 | 506.000000 | 1048.000000 | 505.000000 | 0.0 | ... | 5.163725e+07 | 1.184646e+08 | 1.404078e+09 | -1.000000 | 1.000000 | 0.0 | 0.0 | NaN | NaN | NaN |
| 50% | NaN | 4.000000 | 4.000000 | 0.0 | 12.000000 | 633.000000 | 521.000000 | 1078.000000 | 519.000000 | 0.0 | ... | 5.210112e+07 | 1.184646e+08 | 1.404078e+09 | -1.000000 | 2.000000 | 0.0 | 0.0 | NaN | NaN | NaN |
| 75% | NaN | 4.000000 | 4.000000 | 0.0 | 12.000000 | 640.000000 | 527.000000 | 1090.000000 | 525.000000 | 0.0 | ... | 5.259264e+07 | 1.184646e+08 | 1.404078e+09 | -1.000000 | 3.000000 | 0.0 | 0.0 | NaN | NaN | NaN |
| max | NaN | 4.000000 | 4.000000 | 0.0 | 15.000000 | 1211.000000 | 671.000000 | 1378.000000 | 667.000000 | 0.0 | ... | 6.247902e+07 | 1.376044e+08 | 1.747683e+09 | 0.000000 | 5.000000 | 0.0 | 0.0 | NaN | NaN | NaN |
11 rows × 58 columns
In [21]:
real_data.isnull().sum()[real_data.isnull().sum() > 0]
Out[21]:
Series([], dtype: int64)
In [22]:
real_data.nunique()[real_data.nunique() > 1].apply(lambda x: f"{x:<50}{real_data.nunique()[real_data.nunique() > 1].index[real_data.nunique()[real_data.nunique() > 1] == x][0]}")
Out[22]:
timestamp 5084 ... amf_session_value 5 ... bearers_active_value 5 ... fivegs_amffunction_amf_authreq_value 4 ... fivegs_amffunction_mm_confupdate_value 31 ... fivegs_amffunction_mm_confupdatesucc_value 27 ... fivegs_amffunction_mm_paging5greq_value 27 ... fivegs_amffunction_mm_paging5gsucc_value 26 ... fivegs_amffunction_rm_reginitreq_value 9 ... fivegs_amffunction_rm_reginitsucc_value 5 ... fivegs_amffunction_rm_registeredsubnbr_value 5 ... fivegs_amffunction_rm_regmobreq_value 4 ... fivegs_amffunction_rm_regmobsucc_value 4 ... fivegs_amffunction_rm_regperiodreq_value 7 ... fivegs_amffunction_rm_regperiodsucc_value 7 ... fivegs_pcffunction_pa_policyamassoreq_value 5 ... fivegs_pcffunction_pa_policyamassosucc_value 5 ... fivegs_pcffunction_pa_policysmassoreq_value 80 ... fivegs_pcffunction_pa_policysmassosucc_value 80 ... fivegs_pcffunction_pa_sessionnbr_value 7 ... fivegs_smffunction_sm_n4sessionestabreq_value 2 ... fivegs_smffunction_sm_n4sessionreport_value 54 ... fivegs_smffunction_sm_n4sessionreportsucc_value 54 ... fivegs_smffunction_sm_pdusessioncreationreq_value 75 ... fivegs_smffunction_sm_pdusessioncreationsucc_value 75 ... fivegs_smffunction_sm_qos_flow_nbr_value 75 ... fivegs_smffunction_sm_sessionnbr_value 5 ... fivegs_upffunction_sm_n4sessionestabreq_value 79 ... fivegs_upffunction_sm_n4sessionreport_value 55 ... fivegs_upffunction_sm_n4sessionreportsucc_value 54 ... fivegs_upffunction_upf_qosflows_value 9 ... fivegs_upffunction_upf_sessionnbr_value 7 ... gn_rx_createpdpcontextreq_value 3 ... gn_rx_deletepdpcontextreq_value 3 ... gn_rx_parse_failed_value 3 ... gnb_value 3 ... gtp1_pdpctxs_active_value 2 ... process_cpu_seconds_total_value 106 ... process_max_fds_value 2 ... process_open_fds_value 6 ... process_resident_memory_bytes_value 371 ... process_start_time_seconds_value 3 ... process_virtual_memory_bytes_value 3 ... process_virtual_memory_max_bytes_value 2 ... ran_ue_value 6 ... application 4 ... log_type 6 ... current_uc 6 ... dtype: object
In [23]:
duplicates = real_data.duplicated()
duplicates_sum = duplicates.sum()
print(f"Total duplicates: {duplicates_sum}")
Total duplicates: 0
Dáta: V reálnych dátach sa nevyskytujú žiadne chýbajúce hodnoty ani duplikáty.
Záver ¶
Čo treba zodpovedať:
Chýbajúce hodnoty:
- Koľko chýbajúcich hodnôt je v každom stĺpci?
- Žiadne chýbajúce hodnoty
- Koľko chýbajúcich hodnôt je v každom stĺpci?
Dátové typy:
- Aké sú dátové typy každého stĺpca?
- timestamp object
- application object
- log_type object
- current_uc object
- Ostatné stĺpce sú float64
- Ako konvertovať dátové typy?
- Map the columns to the correct data types using the
astype()method - Použite
astype()na konverziu stĺpcov na správne dátové typy.
- Map the columns to the correct data types using the
- Aké sú dátové typy každého stĺpca?
Duplikáty:
- Aké sú duplicitné stĺpce v datasete?
- V datasete nie sú žiadne duplicitné stĺpce.
- Koľko duplicitných riadkov je v datasete?
- 0
- Aké sú duplicitné stĺpce v datasete?
Príprava dát ¶
In [ ]:
# Remove columns with only one unique value
real_data = real_data.loc[:, real_data.nunique() > 1]
In [ ]:
# Missing values imputation
real_data.fillna(real_data.mode().iloc[0], inplace=True)
# Check for missing values again
real_data.isnull().sum()[real_data.isnull().sum() > 0]
Out[ ]:
Series([], dtype: int64)
In [ ]:
# Convert timestamp to datetime
real_data['timestamp'] = pd.to_datetime(real_data['timestamp'])
real_data['application'] = real_data['application'].map(APP_MAP)
real_data['log_type'] = real_data['log_type'].map(LOG_MAP)
real_data['current_uc'] = real_data['current_uc'].map(UC_MAP)
# Check if we mapped the values correctly
real_data.dtypes
Out[ ]:
timestamp datetime64[ns] amf_session_value float64 bearers_active_value float64 fivegs_amffunction_amf_authreq_value float64 fivegs_amffunction_mm_confupdate_value float64 fivegs_amffunction_mm_confupdatesucc_value float64 fivegs_amffunction_mm_paging5greq_value float64 fivegs_amffunction_mm_paging5gsucc_value float64 fivegs_amffunction_rm_reginitreq_value float64 fivegs_amffunction_rm_reginitsucc_value float64 fivegs_amffunction_rm_registeredsubnbr_value float64 fivegs_amffunction_rm_regmobreq_value float64 fivegs_amffunction_rm_regmobsucc_value float64 fivegs_amffunction_rm_regperiodreq_value float64 fivegs_amffunction_rm_regperiodsucc_value float64 fivegs_pcffunction_pa_policyamassoreq_value float64 fivegs_pcffunction_pa_policyamassosucc_value float64 fivegs_pcffunction_pa_policysmassoreq_value float64 fivegs_pcffunction_pa_policysmassosucc_value float64 fivegs_pcffunction_pa_sessionnbr_value float64 fivegs_smffunction_sm_n4sessionestabreq_value float64 fivegs_smffunction_sm_n4sessionreport_value float64 fivegs_smffunction_sm_n4sessionreportsucc_value float64 fivegs_smffunction_sm_pdusessioncreationreq_value float64 fivegs_smffunction_sm_pdusessioncreationsucc_value float64 fivegs_smffunction_sm_qos_flow_nbr_value float64 fivegs_smffunction_sm_sessionnbr_value float64 fivegs_upffunction_sm_n4sessionestabreq_value float64 fivegs_upffunction_sm_n4sessionreport_value float64 fivegs_upffunction_sm_n4sessionreportsucc_value float64 fivegs_upffunction_upf_qosflows_value float64 fivegs_upffunction_upf_sessionnbr_value float64 gn_rx_createpdpcontextreq_value float64 gn_rx_deletepdpcontextreq_value float64 gn_rx_parse_failed_value float64 gnb_value float64 gtp1_pdpctxs_active_value float64 process_cpu_seconds_total_value float64 process_max_fds_value float64 process_open_fds_value float64 process_resident_memory_bytes_value float64 process_start_time_seconds_value float64 process_virtual_memory_bytes_value float64 process_virtual_memory_max_bytes_value float64 ran_ue_value float64 application int64 log_type int64 current_uc int64 dtype: object
Dátové typy: Všetky stĺpce boli konvertované na správne dátové typy ('float64', 'int64', 'datetime64[ns]').
Vizualizácia dát ¶
In [ ]:
# Select only numerical columns for feature selection
numerical_cols = real_data.select_dtypes(include=['float64', 'int64']).columns.tolist()
n_cols = 3
n_rows = math.ceil(len(numerical_cols) / n_cols)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
axes = axes.flatten()
for i, col in enumerate(numerical_cols):
# Histogram
sns.histplot(real_data[col], kde=True, bins=30, ax=axes[i])
axes[i].set_title(f'Distribution: {col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Frequency')
# Remove empty axes
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
In [ ]:
# Correlation matrix
corr = real_data.corr()
plt.figure(figsize=(25, 25))
plt.title("Correlation Matrix")
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm', square=True, cbar_kws={"shrink": .8})
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
Korelácia: Narozdiel od syntetických dát, v reálnych dátach je silná korelácia medzi niektorými metrikami a UC triedou.
Výber metrík ¶
In [30]:
data = pd.read_csv("../synthetic_data.csv")
real_data = pd.read_csv("../real_data.csv")
In [ ]:
def preprocess_data(data):
"""
Preprocess the input DataFrame: handle missing values, map categorical variables,
select numerical features, and normalize the dataset.
"""
data.fillna(data.mode().iloc[0], inplace=True)
data['application'] = data['application'].map(APP_MAP)
data['log_type'] = data['log_type'].map(LOG_MAP)
data['current_uc'] = data['current_uc'].map(UC_MAP)
# Numerical columns
X = data.drop(columns=['timestamp', 'current_uc'], errors='ignore')
X = X.select_dtypes(include=[np.number])
y = data['current_uc'].astype(int)
# Data scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
return X_scaled, X, y
In [ ]:
def base_estimator(X_scaled, X, y):
"""Fit a Random Forest model and calculate feature importances."""
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
# Fit the model
rf.fit(X_scaled, y)
rf_importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
return rf_importances, rf
In [ ]:
def RFE_def(X_scaled, y, X, rf):
"""Apply Recursive Feature Elimination (RFE) for feature selection."""
rfe = RFE(estimator=rf, n_features_to_select=10)
rfe.fit(X_scaled, y)
rfe_selected = pd.Series(rfe.support_, index=X.columns)
return rfe_selected
In [ ]:
def RFECV_def(X_scaled, y, X, rf):
"""Apply Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection."""
rfecv = RFECV(estimator=rf, step=1, cv=StratifiedKFold(5), scoring='f1_weighted', n_jobs=-1)
rfecv.fit(X_scaled, y)
rfecv_selected = pd.Series(rfecv.support_, index=X.columns)
return rfecv_selected
In [ ]:
def SFSX_def(X_scaled, y, X, rf):
"""Sequential Feature Selector for feature selection."""
sfs = SequentialFeatureSelector(rf, n_features_to_select=10, direction='forward', scoring='f1_weighted', cv=5, n_jobs=-1)
sfs.fit(X_scaled, y)
sfs_selected = pd.Series(sfs.get_support(), index=X.columns)
return sfs_selected
In [ ]:
X_scaled, X, y = preprocess_data(data)
rf_importances, rf = base_estimator(X_scaled, X, y)
rfe_selected = RFE_def(X_scaled, y, X, rf)
rfecv_selected = RFECV_def(X_scaled, y, X, rf)
sfs_selected = SFSX_def(X_scaled, y, X, rf)
summary_df = pd.DataFrame({
'Feature': X.columns,
'RF_Importance': rf_importances,
'Selected_RFE': rfe_selected,
'Selected_RFECV': rfecv_selected,
'Selected_SFS': sfs_selected
}).sort_values(by='RF_Importance', ascending=False)
summary_df
Out[ ]:
| Feature | RF_Importance | Selected_RFE | Selected_RFECV | Selected_SFS | |
|---|---|---|---|---|---|
| process_cpu_seconds_total_value | process_open_fds_value | 0.140548 | True | True | False |
| ran_ue_value | s5c_rx_parse_failed_value | 0.136606 | True | True | True |
| fivegs_amffunction_rm_reginitreq_value | fivegs_amffunction_rm_reginitsucc_value | 0.091741 | True | True | False |
| process_resident_memory_bytes_value | process_virtual_memory_bytes_value | 0.065022 | True | True | False |
| fivegs_amffunction_amf_authreq_value | fivegs_amffunction_mm_confupdate_value | 0.048014 | True | True | False |
| fivegs_upffunction_sm_n4sessionestabreq_value | fivegs_upffunction_sm_n4sessionreport_value | 0.046800 | True | True | False |
| fivegs_amffunction_mm_confupdate_value | fivegs_amffunction_mm_confupdatesucc_value | 0.044649 | True | True | False |
| fivegs_smffunction_sm_pdusessioncreationsucc_value | fivegs_smffunction_sm_qos_flow_nbr_value | 0.039789 | False | True | False |
| fivegs_pcffunction_pa_policyamassoreq_value | fivegs_pcffunction_pa_policyamassosucc_value | 0.039624 | True | True | False |
| fivegs_amffunction_rm_reginitsucc_value | fivegs_amffunction_rm_registeredsubnbr_value | 0.039166 | True | True | False |
| fivegs_smffunction_sm_pdusessioncreationreq_value | fivegs_smffunction_sm_pdusessioncreationsucc_v... | 0.038534 | False | True | True |
| fivegs_pcffunction_pa_policyamassosucc_value | fivegs_pcffunction_pa_policysmassoreq_value | 0.038439 | True | True | False |
| process_start_time_seconds_value | process_virtual_memory_max_bytes_value | 0.037987 | False | True | False |
| fivegs_pcffunction_pa_policysmassoreq_value | fivegs_pcffunction_pa_policysmassosucc_value | 0.035636 | False | True | False |
| fivegs_pcffunction_pa_policysmassosucc_value | fivegs_pcffunction_pa_sessionnbr_value | 0.034669 | False | True | False |
| fivegs_smffunction_sm_qos_flow_nbr_value | fivegs_smffunction_sm_sessionnbr_value | 0.034136 | False | True | False |
| log_type | process_max_fds_value | 0.022302 | False | True | False |
| application | bearers_active_value | 0.021549 | False | True | True |
| process_open_fds_value | process_start_time_seconds_value | 0.010636 | False | True | True |
| process_virtual_memory_bytes_value | ran_ue_value | 0.009913 | False | True | True |
| fivegs_pcffunction_pa_sessionnbr_value | fivegs_smffunction_sm_n4sessionestabreq_value | 0.004060 | False | True | True |
| bearers_active_value | fivegs_amffunction_amf_authreject_value | 0.003886 | False | True | True |
| fivegs_upffunction_upf_qosflows_value | fivegs_upffunction_upf_sessionnbr_value | 0.003819 | False | True | True |
| fivegs_smffunction_sm_sessionnbr_value | fivegs_upffunction_sm_n4sessionestabreq_value | 0.003653 | False | True | False |
| fivegs_upffunction_upf_sessionnbr_value | gn_rx_createpdpcontextreq_value | 0.003404 | False | True | False |
| amf_session_value | amf_session_value | 0.002849 | False | True | True |
| fivegs_amffunction_rm_registeredsubnbr_value | fivegs_amffunction_rm_regmobreq_value | 0.002568 | False | True | False |
| fivegs_smffunction_sm_n4sessionreport_value | fivegs_smffunction_sm_n4sessionreportsucc_value | 0.000000 | False | True | False |
| fivegs_smffunction_sm_n4sessionreportsucc_value | fivegs_smffunction_sm_pdusessioncreationreq_value | 0.000000 | False | True | False |
| s5c_rx_createsession_value | application | 0.000000 | False | True | True |
| fivegs_amffunction_amf_authreject_value | fivegs_amffunction_amf_authreq_value | 0.000000 | False | True | False |
| process_virtual_memory_max_bytes_value | s5c_rx_createsession_value | 0.000000 | False | True | False |
| fivegs_amffunction_mm_confupdatesucc_value | fivegs_amffunction_mm_paging5greq_value | 0.000000 | False | True | False |
| fivegs_amffunction_mm_paging5greq_value | fivegs_amffunction_mm_paging5gsucc_value | 0.000000 | False | True | False |
| fivegs_amffunction_mm_paging5gsucc_value | fivegs_amffunction_rm_regemergreq_value | 0.000000 | False | True | False |
| fivegs_amffunction_rm_regemergreq_value | fivegs_amffunction_rm_regemergsucc_value | 0.000000 | False | True | False |
| process_max_fds_value | process_resident_memory_bytes_value | 0.000000 | False | True | False |
| fivegs_amffunction_rm_regemergsucc_value | fivegs_amffunction_rm_reginitreq_value | 0.000000 | False | True | False |
| gtp_peers_active_value | process_cpu_seconds_total_value | 0.000000 | False | True | False |
| gtp_new_node_failed_value | gtp_peers_active_value | 0.000000 | False | True | False |
| gtp2_sessions_active_value | gtp_new_node_failed_value | 0.000000 | False | True | False |
| gtp1_pdpctxs_active_value | gtp2_sessions_active_value | 0.000000 | False | True | False |
| gnb_value | gtp1_pdpctxs_active_value | 0.000000 | False | False | False |
| gn_rx_parse_failed_value | gnb_value | 0.000000 | False | False | False |
| gn_rx_deletepdpcontextreq_value | gn_rx_parse_failed_value | 0.000000 | False | False | False |
| gn_rx_createpdpcontextreq_value | gn_rx_deletepdpcontextreq_value | 0.000000 | False | False | False |
| fivegs_amffunction_rm_regmobreq_value | fivegs_amffunction_rm_regmobsucc_value | 0.000000 | False | False | False |
| fivegs_amffunction_rm_regmobsucc_value | fivegs_amffunction_rm_regperiodreq_value | 0.000000 | False | False | False |
| fivegs_upffunction_sm_n4sessionreportsucc_value | fivegs_upffunction_upf_qosflows_value | 0.000000 | False | False | False |
| fivegs_upffunction_sm_n4sessionreport_value | fivegs_upffunction_sm_n4sessionreportsucc_value | 0.000000 | False | False | False |
| fivegs_amffunction_rm_regperiodreq_value | fivegs_amffunction_rm_regperiodsucc_value | 0.000000 | False | False | False |
| fivegs_amffunction_rm_regperiodsucc_value | fivegs_ep_n3_gtp_indatapktn3upf_value | 0.000000 | False | False | False |
| fivegs_ep_n3_gtp_indatapktn3upf_value | fivegs_ep_n3_gtp_outdatapktn3upf_value | 0.000000 | False | False | False |
| fivegs_ep_n3_gtp_outdatapktn3upf_value | fivegs_pcffunction_pa_policyamassoreq_value | 0.000000 | False | False | False |
| fivegs_smffunction_sm_n4sessionestabreq_value | fivegs_smffunction_sm_n4sessionreport_value | 0.000000 | False | True | False |
| s5c_rx_parse_failed_value | log_type | 0.000000 | False | True | False |
In [37]:
X_scaled_real, X_real, y_real = preprocess_data(real_data)
rf_importances_real, rf_real = base_estimator(X_scaled_real, X_real, y_real)
rfe_selected_real = RFE_def(X_scaled_real, y_real, X_real, rf_real)
rfecv_selected_real = RFECV_def(X_scaled_real, y_real, X_real, rf_real)
sfs_selected_real = SFSX_def(X_scaled_real, y_real, X_real, rf_real)
summary_real_df = pd.DataFrame({
'Feature': X_real.columns,
'RF_Importance': rf_importances_real,
'Selected_RFE': rfe_selected_real,
'Selected_RFECV': rfecv_selected_real,
'Selected_SFS': sfs_selected_real
}).sort_values(by='RF_Importance', ascending=False)
summary_real_df
Out[37]:
| Feature | RF_Importance | Selected_RFE | Selected_RFECV | Selected_SFS | |
|---|---|---|---|---|---|
| process_resident_memory_bytes_value | process_virtual_memory_bytes_value | 0.126318 | True | True | False |
| process_cpu_seconds_total_value | process_open_fds_value | 0.062450 | False | True | False |
| fivegs_smffunction_sm_qos_flow_nbr_value | fivegs_smffunction_sm_sessionnbr_value | 0.057770 | True | True | False |
| fivegs_smffunction_sm_pdusessioncreationsucc_value | fivegs_smffunction_sm_qos_flow_nbr_value | 0.051967 | True | True | False |
| fivegs_pcffunction_pa_policysmassoreq_value | fivegs_pcffunction_pa_policysmassosucc_value | 0.049094 | True | True | False |
| fivegs_smffunction_sm_n4sessionreport_value | fivegs_smffunction_sm_n4sessionreportsucc_value | 0.049090 | True | True | True |
| fivegs_pcffunction_pa_policysmassosucc_value | fivegs_pcffunction_pa_sessionnbr_value | 0.048142 | True | True | False |
| fivegs_amffunction_mm_paging5gsucc_value | fivegs_amffunction_rm_regemergreq_value | 0.047285 | False | True | False |
| fivegs_smffunction_sm_n4sessionreportsucc_value | fivegs_smffunction_sm_pdusessioncreationreq_value | 0.044607 | True | True | True |
| fivegs_smffunction_sm_pdusessioncreationreq_value | fivegs_smffunction_sm_pdusessioncreationsucc_v... | 0.041712 | False | True | False |
| fivegs_upffunction_sm_n4sessionestabreq_value | fivegs_upffunction_sm_n4sessionreport_value | 0.039699 | True | True | False |
| fivegs_upffunction_sm_n4sessionreportsucc_value | fivegs_upffunction_upf_qosflows_value | 0.032001 | False | True | False |
| fivegs_pcffunction_pa_sessionnbr_value | fivegs_smffunction_sm_n4sessionestabreq_value | 0.030837 | True | True | False |
| fivegs_upffunction_sm_n4sessionreport_value | fivegs_upffunction_sm_n4sessionreportsucc_value | 0.030378 | True | True | False |
| bearers_active_value | fivegs_amffunction_amf_authreject_value | 0.029796 | False | True | True |
| fivegs_upffunction_upf_sessionnbr_value | gn_rx_createpdpcontextreq_value | 0.025697 | False | True | False |
| fivegs_amffunction_mm_confupdatesucc_value | fivegs_amffunction_mm_paging5greq_value | 0.025319 | False | True | False |
| fivegs_amffunction_mm_confupdate_value | fivegs_amffunction_mm_confupdatesucc_value | 0.025103 | False | True | False |
| fivegs_amffunction_mm_paging5greq_value | fivegs_amffunction_mm_paging5gsucc_value | 0.024645 | False | True | False |
| fivegs_upffunction_upf_qosflows_value | fivegs_upffunction_upf_sessionnbr_value | 0.021230 | False | True | False |
| fivegs_amffunction_rm_reginitsucc_value | fivegs_amffunction_rm_registeredsubnbr_value | 0.020025 | False | False | False |
| fivegs_smffunction_sm_sessionnbr_value | fivegs_upffunction_sm_n4sessionestabreq_value | 0.015973 | False | False | False |
| fivegs_pcffunction_pa_policyamassosucc_value | fivegs_pcffunction_pa_policysmassoreq_value | 0.014403 | False | False | False |
| amf_session_value | amf_session_value | 0.013929 | False | True | True |
| fivegs_pcffunction_pa_policyamassoreq_value | fivegs_pcffunction_pa_policyamassosucc_value | 0.013178 | False | False | False |
| fivegs_amffunction_rm_reginitreq_value | fivegs_amffunction_rm_reginitsucc_value | 0.011465 | False | False | False |
| ran_ue_value | s5c_rx_parse_failed_value | 0.008889 | False | False | False |
| fivegs_amffunction_rm_regmobreq_value | fivegs_amffunction_rm_regmobsucc_value | 0.004324 | False | False | False |
| fivegs_amffunction_amf_authreq_value | fivegs_amffunction_mm_confupdate_value | 0.003612 | False | False | True |
| gn_rx_createpdpcontextreq_value | gn_rx_deletepdpcontextreq_value | 0.003265 | False | False | False |
| fivegs_amffunction_rm_registeredsubnbr_value | fivegs_amffunction_rm_regmobreq_value | 0.002844 | False | False | True |
| fivegs_amffunction_rm_regperiodsucc_value | fivegs_ep_n3_gtp_indatapktn3upf_value | 0.002757 | False | False | False |
| gn_rx_parse_failed_value | gnb_value | 0.002478 | False | False | False |
| application | bearers_active_value | 0.002411 | False | False | True |
| fivegs_amffunction_rm_regperiodreq_value | fivegs_amffunction_rm_regperiodsucc_value | 0.002292 | False | False | False |
| log_type | process_max_fds_value | 0.002032 | False | False | False |
| process_max_fds_value | process_resident_memory_bytes_value | 0.002027 | False | False | False |
| process_virtual_memory_bytes_value | ran_ue_value | 0.001968 | False | False | False |
| fivegs_amffunction_rm_regmobsucc_value | fivegs_amffunction_rm_regperiodreq_value | 0.001955 | False | False | False |
| process_open_fds_value | process_start_time_seconds_value | 0.001728 | False | False | False |
| gn_rx_deletepdpcontextreq_value | gn_rx_parse_failed_value | 0.001703 | False | False | False |
| gtp1_pdpctxs_active_value | gtp2_sessions_active_value | 0.001265 | False | False | False |
| process_start_time_seconds_value | process_virtual_memory_max_bytes_value | 0.001047 | False | False | True |
| gnb_value | gtp1_pdpctxs_active_value | 0.000983 | False | False | False |
| process_virtual_memory_max_bytes_value | s5c_rx_createsession_value | 0.000209 | False | False | False |
| fivegs_smffunction_sm_n4sessionestabreq_value | fivegs_smffunction_sm_n4sessionreport_value | 0.000100 | False | False | False |
| gtp_peers_active_value | process_cpu_seconds_total_value | 0.000000 | False | False | False |
| gtp_new_node_failed_value | gtp_peers_active_value | 0.000000 | False | False | False |
| gtp2_sessions_active_value | gtp_new_node_failed_value | 0.000000 | False | False | False |
| fivegs_amffunction_rm_regemergreq_value | fivegs_amffunction_rm_regemergsucc_value | 0.000000 | False | False | False |
| fivegs_amffunction_amf_authreject_value | fivegs_amffunction_amf_authreq_value | 0.000000 | False | False | True |
| fivegs_amffunction_rm_regemergsucc_value | fivegs_amffunction_rm_reginitreq_value | 0.000000 | False | False | False |
| fivegs_ep_n3_gtp_indatapktn3upf_value | fivegs_ep_n3_gtp_outdatapktn3upf_value | 0.000000 | False | False | False |
| fivegs_ep_n3_gtp_outdatapktn3upf_value | fivegs_pcffunction_pa_policyamassoreq_value | 0.000000 | False | False | False |
| s5c_rx_createsession_value | application | 0.000000 | False | False | True |
| s5c_rx_parse_failed_value | log_type | 0.000000 | False | False | False |
In [ ]:
# Load the RF importances for synthetic and real data
rf_synth = summary_df[['Feature', 'RF_Importance']]
rf_real = summary_real_df[['Feature', 'RF_Importance']]
# Rename columns for synthetic and real data
rf_synth = rf_synth.rename(columns={'RF_Importance': 'Importance_Synthetic'})
rf_real = rf_real.rename(columns={'RF_Importance': 'Importance_Real'})
# Merge the two DataFrames on 'Feature'
merged = pd.merge(rf_synth, rf_real, on='Feature', how='inner')
# Select the top N features based on the maximum importance from both datasets
top_n = 25
merged['Combined'] = merged[['Importance_Synthetic', 'Importance_Real']].max(axis=1)
top_features = merged.sort_values(by='Combined', ascending=False).head(top_n)
# Graphical representation of the feature importances
plt.figure(figsize=(12, 6))
bar_width = 0.4
indices = range(len(top_features))
plt.barh([i + bar_width for i in indices], top_features['Importance_Synthetic'], height=bar_width, label='Synthetic', color='skyblue')
plt.barh(indices, top_features['Importance_Real'], height=bar_width, label='Real', color='salmon')
plt.yticks([i + bar_width/2 for i in indices], top_features['Feature'])
plt.xlabel('Random Forest Feature Importance')
plt.title('Porovnanie dôležitosti príznakov (Synthetic vs Real)')
plt.legend()
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
In [ ]:
comparison_df = pd.DataFrame({
'Feature': [top_features['Feature'].iloc[i] for i in range(len(top_features))],
'RF_Synthetic': summary_df['RF_Importance'].head(25).tolist(),
'RF_Real': summary_real_df['RF_Importance'].head(25).tolist()
})
# Thresholds for feature selection
real_thresh = 0.03
synthetic_thresh = 0.01
# Select features based on the thresholds
selected_features = comparison_df[
(comparison_df['RF_Real'] >= real_thresh) &
(comparison_df['RF_Synthetic'] >= synthetic_thresh)
]['Feature'].tolist()
selected_features
Out[ ]:
['process_open_fds_value', 's5c_rx_parse_failed_value', 'process_virtual_memory_bytes_value', 'fivegs_amffunction_rm_reginitsucc_value', 'fivegs_smffunction_sm_sessionnbr_value', 'fivegs_smffunction_sm_qos_flow_nbr_value', 'fivegs_pcffunction_pa_policysmassosucc_value', 'fivegs_smffunction_sm_n4sessionreportsucc_value', 'fivegs_pcffunction_pa_sessionnbr_value', 'fivegs_amffunction_mm_confupdate_value', 'fivegs_amffunction_rm_regemergreq_value', 'fivegs_upffunction_sm_n4sessionreport_value', 'fivegs_amffunction_mm_confupdatesucc_value', 'fivegs_smffunction_sm_pdusessioncreationreq_value']
Vybrané metriky: Metriky, ktoré spĺňajú podmienky pre výber metrík z reálnych dát a syntetických dát.
Vybrané metriky: Výber metrík bol vykonaný pomocou skóre dôležitosti Random Forest na syntetických a reálnych datasetoch 5G siete. Metriky boli ponechané iba v prípade, že mali dôležitosť ≥ 0.03 v reálnych dátach (čo naznačuje relevantnosť v reálnom svete) a ≥ 0.01 v syntetických dátach (zabezpečenie aspoň minimálnej generalizácie počas tréningu). Tento dvojitý prahový prístup zmierňuje posun domény, pričom uprednostňuje signály z reálneho sveta a zároveň zachováva kompatibilitu so syntetickým tréningovým prostredím.
In [ ]:
def permut_imp(data, selected_features, label, visualize=False):
"""Calculate Permutation Importance for given features."""
try:
X = data[selected_features]
y = data['current_uc']
X_scaled = StandardScaler().fit_transform(X)
model = RandomForestClassifier(n_estimators=100, random_state=32)
model.fit(X_scaled, y)
result = permutation_importance(model, X_scaled, y, n_repeats=5, random_state=32)
importance_df = pd.DataFrame({
'Feature': selected_features,
f'Permutation_Importance_{label}': result.importances_mean
}).sort_values(by=f'Permutation_Importance_{label}', ascending=False)
# Visualize the results
if visualize:
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df[f'Permutation_Importance_{label}'], color='skyblue')
plt.xlabel("Permutation Importance")
plt.title(f"Permutation Importance ({label})")
plt.gca().invert_yaxis()
plt.grid(True)
plt.tight_layout()
plt.show()
return importance_df
except Exception as e:
print(f"❌ Error in permut_imp for {label}: {e}")
return pd.DataFrame()
In [41]:
permut_imp(data, selected_features, "Synthetic", True)
Out[41]:
| Feature | Permutation_Importance_Synthetic | |
|---|---|---|
| 2 | process_virtual_memory_bytes_value | 0.085930 |
| 9 | fivegs_amffunction_mm_confupdate_value | 0.078626 |
| 3 | fivegs_amffunction_rm_reginitsucc_value | 0.078430 |
| 6 | fivegs_pcffunction_pa_policysmassosucc_value | 0.070472 |
| 5 | fivegs_smffunction_sm_qos_flow_nbr_value | 0.047195 |
| 13 | fivegs_smffunction_sm_pdusessioncreationreq_value | 0.042757 |
| 0 | process_open_fds_value | 0.035987 |
| 8 | fivegs_pcffunction_pa_sessionnbr_value | 0.021392 |
| 4 | fivegs_smffunction_sm_sessionnbr_value | 0.011509 |
| 1 | s5c_rx_parse_failed_value | 0.000000 |
| 7 | fivegs_smffunction_sm_n4sessionreportsucc_value | 0.000000 |
| 10 | fivegs_amffunction_rm_regemergreq_value | 0.000000 |
| 11 | fivegs_upffunction_sm_n4sessionreport_value | 0.000000 |
| 12 | fivegs_amffunction_mm_confupdatesucc_value | 0.000000 |
In [42]:
permut_imp(real_data, selected_features, "Real", True)
Out[42]:
| Feature | Permutation_Importance_Real | |
|---|---|---|
| 7 | fivegs_smffunction_sm_n4sessionreportsucc_value | 0.036846 |
| 8 | fivegs_pcffunction_pa_sessionnbr_value | 0.034311 |
| 6 | fivegs_pcffunction_pa_policysmassosucc_value | 0.015859 |
| 13 | fivegs_smffunction_sm_pdusessioncreationreq_value | 0.008018 |
| 5 | fivegs_smffunction_sm_qos_flow_nbr_value | 0.006043 |
| 4 | fivegs_smffunction_sm_sessionnbr_value | 0.000884 |
| 11 | fivegs_upffunction_sm_n4sessionreport_value | 0.000796 |
| 9 | fivegs_amffunction_mm_confupdate_value | 0.000059 |
| 0 | process_open_fds_value | 0.000000 |
| 1 | s5c_rx_parse_failed_value | 0.000000 |
| 2 | process_virtual_memory_bytes_value | 0.000000 |
| 3 | fivegs_amffunction_rm_reginitsucc_value | 0.000000 |
| 10 | fivegs_amffunction_rm_regemergreq_value | 0.000000 |
| 12 | fivegs_amffunction_mm_confupdatesucc_value | 0.000000 |
Permutation Importance (Real Data): Permutačná importance odhaľuje reálne silné znaky, ktoré môžu byť v syntetických dátach slabo zastúpené alebo chýbať.
- fivegs_upffunction_sm_n4sessionreport_value má najvyššiu permutačnú dôležitosť, hoci v syntetických dátach nemal žiaden význam – signalizuje možný bias syntetického datasetu.
- process_virtual_memory_bytes_value, ktorý bol vysoko v syntetike, má nulový prínos v realite.
- Prvky ako fivegs_pcffunction_pa_sessionnbr_value alebo fivegs_pcffunction_pa_policysmassosucc_value sa objavili v oboch dátach – potvrdzujú svoju robustnosť naprieč doménami.
- fivegs_upffunction_sm_n4sessionreport_value má najvyššiu permutačnú dôležitosť, hoci v syntetických dátach nemal žiaden význam – signalizuje možný bias syntetického datasetu.
- process_virtual_memory_bytes_value, ktorý bol vysoko v syntetike, má nulový prínos v realite.
- Prvky ako fivegs_pcffunction_pa_sessionnbr_value alebo fivegs_pcffunction_pa_policysmassosucc_value sa objavili v oboch dátach – potvrdzujú svoju robustnosť naprieč doménami.
In [ ]:
synthetic_perm = permut_imp(data, selected_features, "Synthetic")
real_perm = permut_imp(real_data, selected_features, "Real")
# Check if the permutation importance tables are empty
if synthetic_perm.empty or real_perm.empty:
raise ValueError("❗ One of the permutation importance tables is empty. Check selected_features or data.")
merged = pd.merge(synthetic_perm, real_perm, on='Feature', how='inner')
merged = merged.merge(summary_df[['Feature', 'RF_Importance']], on='Feature', how='left')
merged = merged.rename(columns={'RF_Importance': 'RF_Importance_Synthetic'})
merged = merged.merge(summary_real_df[['Feature', 'RF_Importance']], on='Feature', how='left')
merged = merged.rename(columns={'RF_Importance': 'RF_Importance_Real'})
final_features = merged[
(merged['RF_Importance_Synthetic'] >= 0) &
(merged['Permutation_Importance_Real'] >= 0.001)
].sort_values(by='Permutation_Importance_Real', ascending=False)
final_features = final_features['Feature'].tolist()
# Add the log_type and application columns
new_features = ['log_type', 'application']
final_features = final_features + new_features
output = {"features": final_features}
with open('selected_features.json', 'w') as f:
json.dump(output, f)
output
Out[ ]:
{'features': ['fivegs_smffunction_sm_n4sessionreportsucc_value',
'fivegs_pcffunction_pa_sessionnbr_value',
'fivegs_pcffunction_pa_policysmassosucc_value',
'fivegs_smffunction_sm_pdusessioncreationreq_value',
'fivegs_smffunction_sm_qos_flow_nbr_value',
'log_type',
'application']}
In [ ]:
def permut_imp_stability(data, selected_features, label, n_runs=10):
"""Calculate the stability of Permutation Importance across multiple runs."""
X = data[selected_features]
y = data['current_uc']
X_scaled = StandardScaler().fit_transform(X)
importances = []
for i in range(n_runs):
model = RandomForestClassifier(n_estimators=100, random_state=i)
model.fit(X_scaled, y)
result = permutation_importance(model, X_scaled, y, n_repeats=5, random_state=i)
importances.append(result.importances_mean)
importances = np.array(importances)
mean_importance = np.mean(importances, axis=0)
std_importance = np.std(importances, axis=0)
median_importance = np.median(importances, axis=0)
stability_df = pd.DataFrame({
'Feature': selected_features,
f'PI_Mean_{label}': mean_importance,
f'PI_Std_{label}': std_importance,
f'PI_Median_{label}': median_importance
}).sort_values(by=f'PI_Median_{label}', ascending=False)
# Visualize the results
plt.figure(figsize=(10, 6))
plt.barh(stability_df['Feature'], stability_df[f'PI_Median_{label}'],
xerr=stability_df[f'PI_Std_{label}'], color='skyblue')
plt.xlabel("Median Permutation Importance (± std)")
plt.title(f"Permutation Importance Stability ({label}) across {n_runs} runs")
plt.gca().invert_yaxis()
plt.grid(True)
plt.tight_layout()
plt.show()
return stability_df
In [45]:
stability_synthetic = permut_imp_stability(data, selected_features, "Synthetic")
stability_real = permut_imp_stability(real_data, selected_features, "Real")
Finálny výber znakov (cross-domain validovaný): Na základe kombinácie Random Forest a Permutation Importance v syntetických aj reálnych dátach sme zvolili znaky, ktoré:
✔ Sú informatívne v syntetike (trénovateľné)
✔ Majú reálnu dôležitosť (reálne použiteľné)
Finálny výber (8 znakov):
fivegs_pcffunction_pa_policysmassosucc_value
fivegs_smffunction_sm_qos_flow_nbr_value
fivegs_pcffunction_pa_sessionnbr_value
fivegs_smffunction_sm_pdusessioncreationreq_value
fivegs_smffunction_sm_n4sessionreportsucc_value
log_type (dominantný v syntetických dátach)
application (dominantný v reálnych dátach)
Týmto výberom minimalizujeme doménový bias a maximalizujeme robustnosť pri generalizácii z trénovania na syntetike do reálneho prostredia.
✔ Sú informatívne v syntetike (trénovateľné)
✔ Majú reálnu dôležitosť (reálne použiteľné)
Finálny výber (8 znakov):
fivegs_pcffunction_pa_policysmassosucc_value
fivegs_smffunction_sm_qos_flow_nbr_value
fivegs_pcffunction_pa_sessionnbr_value
fivegs_smffunction_sm_pdusessioncreationreq_value
fivegs_smffunction_sm_n4sessionreportsucc_value
log_type (dominantný v syntetických dátach)
application (dominantný v reálnych dátach)
Týmto výberom minimalizujeme doménový bias a maximalizujeme robustnosť pri generalizácii z trénovania na syntetike do reálneho prostredia.
Finálny výber znakov (cross-domain validovaný): Finálny výber znakov je založený na kombinácii permutačnej dôležitosti v syntetických a reálnych dátach. Cieľom je zabezpečiť, že výber reflektuje reálne správanie siete, ale zároveň je trénovateľný na syntetických dátach. Do finálnej množiny sme zaradili iba tie znaky, ktoré vykazovali konzistentný informačný prínos naprieč oboma doménami. Týmto prístupom sme eliminovali znaky, ktoré sú síce dominantné v syntetickom prostredí, ale nereprezentujú realitu (napr. process_virtual_memory_bytes_value), čím sa znižuje riziko doménového biasu.
Referencie ¶
NGUYEN, Giang. Introduction to Data Science. 1. vyd. Bratislava: Slovak University of Technology in Bratislava, 2022. ISBN 978-80-227-5193-3.
Alejopaullier (2024) Make your notebooks look better. https://www.kaggle.com/code/alejopaullier/make-your-notebooks-look-better.
Huang, N., Lu, G. and Xu, D., 2016. A permutation importance-based feature selection method for short-term electricity load forecasting using random forest. Energies, 9(10), p.767. Available at: https://doi.org/10.3390/en9100767